Using Wikipedia to Collect a Corpus for Automatic Definition Extraction: Comparing English and Portuguese Languages

ثبت نشده

چکیده

Systems for the detection and extraction of definitions are being developed for different purposes, such as glossaries creation [5, 3], lexical databases [6], ontologies [2], question answering [1], etc. All these systems use annotated corpora to build a set of rules or patterns capable to identify a definition in a different text. The basic structure of a definition should resemble an equation with the definiendum (what is to be defined) on the left hand side and the definiens (the part which is doing the defining) on the right hand side. Between the term defined, and its description there is a a connector, usually a verb or a punctuation symbol. In general, works in this field are restricted in terms of number and types of definitions considered, they are based on specific limited corpora very domain specific, lacking of a general approach. This limitation is due to scarcity of corpora previously annotated with definition information, as these corpora are not usually available and the annotation process constitutes a very expensive task. In this work we propose to use wikipedia as a corpus to extract general domains definitions, that can represent a bootstrap in the construction of a automatic definition extractor. The corpus can be used to draw pattern or extract lexical information characterizing definitions. The convenience of using Wikipedia as font for definition is based on the peculiar structure of its articles, following well-defined rules stated by Wikipedia itself that contributors should follow when write an article. In particular Wikipedia states that the first paragraph of each article should define the topic of the article. In this paper, we focus on the issues arising when extracting a general balanced corpus composed by Wikipedia articles and the size of such a corpus. We presented a study using two different languages, that is Portuguese and English, two different algorithms, and corpora of 5 different sizes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

The Presence and Influence of English in the Portuguese Financial Media

As the lingua franca of the 21st century, English has become the main language for intercultural communication for those wanting to embrace globalization. In Portugal, it is the second language of most public and private domains influencing its culture and discourses. Language contact situations transform languages by the incorporations they make from other languages and Portugal has...

متن کامل

Classifying articles in English and German Wikipedia

Named Entity (NE) information is critical for Information Extraction (IE) tasks. However, the cost of manually annotating sufficient data for training purposes, especially for multiple languages, is prohibitive, meaning automated methods for developing resources are crucial. We investigate the automatic generation of NE annotated data in German from Wikipedia. By incorporating structural featur...

متن کامل

Acquisition of Medical Terminology for Ukrainian from Parallel Corpora and Wikipedia

The increasing availability of parallel bilingual corpora and of automatic methods and tools for their processing makes it possible to build linguistic and terminological resources for low-resourced languages. We propose to exploit various corpora available in several languages in order to build bilingual and trilingual terminologies. Typically, terminology information extracted in French and E...

متن کامل

Measuring Comparability of Multilingual Corpora Extracted from Wikipedia

Comparable corpora can be used for many linguistic tasks such as bilingual lexicon extraction. By improving the quality of comparable corpora, we improve the quality of the extraction. This article describes some strategies to build comparable corpora from Wikipedia and proposes a measure of comparability. Experiments were performed on Portuguese, Spanish, and English Wikipedia.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Using Wikipedia to Collect a Corpus for Automatic Definition Extraction: Comparing English and Portuguese Languages

ثبت نشده

چکیده

منابع مشابه

Comparing k-means clusters on parallel Persian-English corpus

The Presence and Influence of English in the Portuguese Financial Media

Classifying articles in English and German Wikipedia

Acquisition of Medical Terminology for Ukrainian from Parallel Corpora and Wikipedia

Measuring Comparability of Multilingual Corpora Extracted from Wikipedia

عنوان ژورنال:

اشتراک گذاری